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Abstract 

Background: Recurrent gene duplication and retention played an important role in angiosperm genome evolution. 
It has been hypothesized that these processes contribute significantly to plant adaptation but so far this hypothesis 
has not been tested at the genome scale. 

Results: We studied available sequenced angiosperm genomes to assess the frequency of positive selection 
footprints in lineage specific expanded (LSE) gene families compared to single-copy genes using a c/n/c/s- based test 
in a phylogenetic framework. We found 5.38% of alignments in LSE genes with codons under positive selection. In 
contrast, we found no evidence for codons under positive selection in the single-copy reference set. An analysis at 
the branch level shows that purifying selection acted more strongly on single-copy genes than on LSE gene 
clusters. Moreover we detect significantly more branches indicating evolution under positive selection and/or 
relaxed constraint in LSE genes than in single-copy genes. 

Conclusions: In this - to our knowledge -first genome-scale study we provide strong empirical support for the 
hypothesis that LSE genes fuel adaptation in angiosperms. Our conservative approach for detecting selection 
footprints as well as our results can be of interest for further studies on (plant) gene family evolution. 

Keywords: Lineage specific expansion (LSE), Gene duplication. Gene retention, Ultraparalogs (UP), Superorthologs (SO), 
Comparative genomics. Positive selection. Adaptation 



Background 

Duplicated genes have been suggested to be the raw ma- 
terial for the evolution of new functions and important 
players in adaptive evolution [1]. Genomes are constantly 
subject to rearrangements, by both whole genome dupli- 
cation (WGD) and small-scale genome duplication (SSD), 
where tandemly duplicated genes (TDG) are a common 
case of SSD which generate clusters of physically linked 
genes. The genomes of angiosperms (flowering plants) are 
of particular interest to study the impact of gene duplica- 
tion. Compared to mammals and even to most other plant 
genomes, angiosperms undergo WGDs, recombination, 
and retrotransposition more frequently; as a consequence, 
they also display a larger range of genome sizes and chromo- 
some numbers [2,3]. Most angiosperm genomes sequenced 
so far show evidence for at least one (but usually more) 
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WGD event during their evolution (see e.g. [4-7]). The im- 
portance of TDGs has also been shown in Oryza sativa 
(rice) and Arabidopsis thaliana where TDGs comprise 15- 
20% of all coding genes [8-10]. Using genomic and expres- 
sion data in plants, Hanada et al. [11] showed that TDGs 
tend to be involved in response to environmental stimuli 
and are enriched in genes up-regulated under biotic stress. 
This suggests that TDGs play an important role in adapta- 
tion of plants to changing environments [11-13]. Taken 
together, these findings demonstrate the dynamic nature 
of angiosperm genomes and raise the question of the im- 
pact of gene duplications on plant adaptation. 

Gene duplication creates an unstable state of functional 
redundancy, which in most cases will disappear by loss of 
one copy through accumulation of degenerative muta- 
tions, recombination and/or genetic drift. But sometimes 
both copies are long-term preserved due to functional 
changes reducing their redundancy and making the loss of 
one copy disadvantageous [14]. Although the respective 
roles of adaptive versus non-adaptive processes in the 
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maintenance of gene duplicates have been much debated 
(for general reviews see [15-18]), gene duplication should 
increase the occurrence of adaptation for several reasons. 
First, it can allow the fixation of beneficial mutations on 
one copy, leading to neofunctionalization, while the other 
copy ensures the ancestral function [16,19]. Second, it can 
free the genome from an "adaptive conflict" if the different 
functions of an ancestral (single) gene cannot be improved 
independently [20-22]. Third, even when adaptation is not 
involved in the initial conservation of duplicates, the pres- 
ence of two (or more) copies is expected to increase the 
adaptation rate under certain conditions. Duplication in- 
creases the number of gene copies, hence the rate of ap- 
pearance of beneficial mutations. Otto & Whitton [23] 
showed that if beneficial mutations are dominant or partly 
dominant, the rate of adaptation should increase with 
copy number (or ploidy level). If concerted evolution 
among gene copies is taken into account, Mano & Innan 
[24] showed that gene conversion {Le, exchange of genetic 
material between duplicates in a copy and paste manner) 
increases the effective population size of gene families 
proportionally to the number of gene members, thus in- 
creasing the efficacy of weak selection. Their model pre- 
dicts that the rate of adaptive substitutions increases with 
the number of gene copies. Overall we thus expect higher 
rates of adaptive evolution in multigene families than in 
single-copy genes. 

As a result of the complex histories of duplicated 
genes, the retention rate {Le, the proportion of duplicated 
genes that are maintained in genomes) varies according to 
several factors including time since the duplication event. 



protein function, or duplication mode [10]. These varia- 
tions in retention rates have direct consequences on gene 
family organization and evolution. Reconciliation methods 
exploit the observed discrepancies between gene family 
trees and species trees to infer gene duplication, gene 
transfer, and gene loss (see [25] for an overview). Among 
other things, reconciliation methods can be used to esti- 
mate duplication or transfer rates and to predict sequence 
orthology (=sequences related by speciation) [26,27]. 
Using this method, the extreme heterogeneity of duplica- 
tion/retention rates among taxa and gene families and/or 
subfamilies was demonstrated {e,g, [28-32]). In particular, 
reconciliation allows for identification of cases in which 
recurrent events of duplications (followed by retention) 
are specific of some lineages and create clades of paralogs 
{Le, sequences related by duplication) in phylogenetic trees 
(Figure la). Note that since only retained duplications are 
observable, it is hard to estimate duplication and retention 
rates independently; hence our use of the "duplication/re- 
tention rate" terminology. 

Lineage specific duplications /retentions are of particular 
interest because the recurrence of such events in the same 
lineage and in a short period of evolutionary time raises 
the question of their adaptive role to an even greater ex- 
tent. To test the hypothesis that lineage specific expansion 
(LSE) of gene families enhances adaptation we compared 
positive (Darwinian) selection footprints in lineages con- 
taining recent and specific duplicated genes to reference 
lineages containing only single-copy genes. One way to de- 
tect positive selection is by analyzing nucleotide substitution 
patterns at the codon level in a phylogenetic framework. 
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Figure 1 Workflow overview, (a) Example of a protein tree as it can be found on the GreenPhylDB. The Arabidopsis sequences 7-12 (cluster 
B) are only related by duplication (nodes with red boxes) and are therefore ultraparalogs (=UP; red lines). Sequences only related by speciation 
are superorthologs (=S0; blue lines). Those are sequences 1 - 6 (cluster A), 13 and 14 (cluster C), 15 - 17 (cluster D). For example, sequences 13 
and 15 are paralogs (as they are related by duplication) but not ultraparalogs (as a speciation event occurred after duplication). We used only 
clusters containing at least six sequences (clusters A and B, bold) for further analysis. The dashed lines indicate that SO and UP clusters can come 
from the same GreenPhyl tree (SOI and UPl datasets) or from separate trees (S02 and UP2 datasets). (b) Corresponding CDS sequences were 
downloaded for each cluster. The clusters were aligned using PRANK [71] and cleaned with GUIDANCE [42]. (c) For all alignments, phylogenetic 
trees were created using PhyML [76]. (d) Positive selection was inferred on codons using PAML's codemi [74] and on braches using mapNH 
[78,79] in all alignments. 



Fischer et at. BMC Plant Biology 2014, 14:1 51 
http://www.bionnedcentral.conn/1471-2229/14/151 



Page 3 of 1 5 



Nucleotide substitutions can either be nonsynonymous 
{Le, protein changing, thereby potentially impacting the 
fitness) or synonymous {Le, not protein changing, thereby 
theoretically without consequences for the fitness). The 
nonsynonymous/synonymous substitution rate ratio, de- 
noted as d^lds or co, can be used to infer the direction 
and strength of natural selection. If no selection is acting, 
(0 should equal 1. An co value smaller than 1 indicates an 
under-representation of nonsynonymous substitutions, 
which can be interpreted as the preferential elimination of 
deleterious mutations by purifying selection. The closer co 
is to zero, the stronger purifying selection is acting. On 
the other hand, if co is larger than 1 it indicates an over- 
representation of nonsynonymous substitutions, which 
can be interpreted as positive selection on new variants. 
Using such an approach, positive selection has been de- 
tected for MADS-box transcription factors [33], monosac- 
charide transporters [34], genes involved in a triterpene 
pathway [35], an anthocyanin pathway enzyme encoding 
gene [36], and epimerase genes [37] to mention only a few 
examples in plants. So far, this approach has mostly been 
applied to single candidate gene families. Thanks to the 
availability of numerous completely sequenced plant ge- 
nomes, it can now be used at the genome level for several 
angiosperm species. 

The dynamic nature of angiosperm genomes makes 
them an ideal system to study the link between gene du- 
plication/retention rate heterogeneity and adaptation. 
Assuming that adaptation is acting when positive selec- 
tion footprints are detected, we want to test if positive 
selection can be observed more frequently in LSE genes 
compared to single-copy genes. We applied a d^/ds- 
based test to detect positive selection as it is easy to use 
on a large scale, it is one of the most stringent tests 
[38-40], and it has been applied successfully in many 
similar cases (for examples, see above). Using this ap- 
proach, we found 5.38% of codons under positive selec- 
tion in LSE gene families but none in single-copy ones. 
In addition, the average co over branches of LSE gene 
trees is almost twice as high as that observed in single- 
copy gene trees. We also found a much higher propor- 
tion of branches under positive selection and/or relaxed 
constraint among LSE gene trees than among single- 
copy gene trees. Taken together, these results strongly 
support the prediction that (at least in angiosperm ge- 
nomes) LSE gene evolution plays an important role in 
adaptation whereas very few single-copy genes seem to 
be involved. 

Results 

Dataset description 

We investigated whole genomes of five monocots (Musa 
acuminata, O. sativa, Brachypodium distachyon, Zea 
maysy Sorghum bicolor) and five dicots {Vitis vinifera, A. 



thaliana, Populus trichocarpa, Glycine max, Medicago 
truncatula). From the GreenPhyl database [41] we ex- 
tracted ultraparalog clusters (UP - sequences only related 
by duplication) which represent our LSE gene set. As a 
single-copy gene reference, we chose a superortholog gene 
set (SO - sequences only related by speciation). To ad- 
dress the question of whether or not positive selection is 
more frequent during LSE events, we compared the re- 
sults obtained on UPs with those obtained on SO gene 
sets. The SO gene set was then divided in two subsets. 
The first one, SOI, contains SO genes extracted from 
GreenPhyl protein trees in which at least one UP cluster 
was also identified. This means that all the trees from 
which an SOI was extracted contain at least one UP clus- 
ter. The second SO set (S02) is the complement of SOI, 
Le, it is composed of SO genes extracted from GreenPhyl 
trees in which no UP clusters were found. Likewise, the 
UPl dataset represents UP clusters extracted from Green- 
Phyl trees also containing SO clusters and the UP2 dataset 
represents UP clusters from GreenPhyl trees from which 
no SO clusters were extracted. We subdivided the dataset 
as we expected a "family effect". This effect may be caused 
by an accelerated evolutionary rate in some families which 
are more prone to gene duplication and/or retention than 
others, e,g due to their function or base composition. If 
one GreenPhyl tree contained more than one SO or UP 
cluster, we kept only one cluster randomly (see Methods 
for details). A detailed overview of the workflow can be 
found in Figure 1. 

Our final dataset for codeml analysis comprised 160 
UPl, 1,512 UP2, 167 SOI, and 1,203 S02 clusters (Table 1). 
The mapNH analysis was performed on 154 UPl, 1,435 
UP2, 167 SOI, and 1,203 S02 clusters (Table 1) and 1,257 
UPl, 14,326 UP2, 1,807 SOI, and 13,374 S02 branches 
(Table 1). The median length of the UPl alignments is 
1,272 bp (base pairs), 1,220 bp for the UP2, 1,230 bp for 
SOI, and 987 bp for S02 alignments (Table 1, Figure 2). 
The UP alignments are significantly longer than the SO 
alignments (Mann- Whitney test: 0.001). This can be 
partially explained by the fact that GUIDANCE introduces 
gaps instead of aligning ambiguous sites [42]. Therefore, 
UP genes - which are frequently under less selective con- 
straint - may produce longer alignments due to the intro- 
duction of gaps. The median number of sequences in an 
alignment {Le, median cluster size) is 7 for UP and SO 
alignments (Table 1, Figure 2). We found that the cluster 
sizes for the SO datasets are significantly smaller than for 
the UP datasets (Mann- Whitney test: p<0,00l) which 
was expected because the number of sequences a super- 
ortholog cluster can contain is at most ten (=number of 
species used in this study) whereas it is not bounded for 
UP clusters. 

As this divergence time between one species and its 
closest relative increases, one might expect that the 



Fischer et at. BMC Plant Biology 2014, 14:1 51 
httpy/www.biomedcentral.com/l 471-2229/1 4/1 5 1 



Page 4 of 1 5 



Table 1 General dataset description 





UP1 


UP2 


UPps 


SOI 


S02 


Clusters for final codemi site model analysis 


160 


1,512 


90 


167 


1,203 


Clusters for final mapNH analysis 


154 


1,435 


90 


167 


1,203 


Total number of branches 


1,881 


22,475 


1,730 


1,817 


13,537 


Number of analysed branches by mapNH 


1,257 


14,326 


1,298 


1,807 


13,374 


Median cluster size Qu; 3'"^ Qu) 


7 (6; 8) 


7 (6; 10) 


8 (6; 13) 


7 (6; 8) 


7 (6; 8) 


Median alignment length (1'^ Qu; 3'"^ Qu) [bp] 


1,272 (792; 1,858) 


1,220 (753; 1,851) 


1,314 (864; 1,942) 


1,230 (900; 1,737) 


987 (651; 1,470) 


Total number of branches 


1,881 


22,475 


1,730 


1,817 


13,537 


Total number of sites 


42,706 


355,486 


21,864 


59,191 


340,556 



Qu quantile; bp base pairs. 

number of detected UPs could also increase when com- 
pared to a distantly related species than to a closely re- 
lated one. Therefore, we tested if the divergence time 
and the number of identified UP clusters correlated. 
Note that we always used the divergence time relative to 
the most closely related species in the GreenPhyl data- 
base, no matter if we analysed this species later (diver- 
gence times can be found in Additional file 1: Figure 
SI). Regression analysis shows that there is no significant 
positive correlation between the divergence time and the 
number of detected clusters: Spearman non-parametric 
correlation coefficient (p) = -0.171, p = 0.626 (Figure 3). 
The correlation remains not significant after removing 
M, trunculata {p = -0.227, p = 0.557). The most likely 
explanation for this lack of correlation is the equilibrium 
between gene duplication and loss over time. The birth/ 
death rate has been shown to be relatively constant over 



time and therefore the frequency of gene copies in a 
genome declines exponentially with age [14]. 

Positive selection at the codon level 

The average number of UP clusters used in the final ana- 
lysis is around 150 clusters per species, with Brachypo- 
dium distachyon showing a very low (63) and Medicago 
truncatula showing a very high (400) number of clusters 
(Table 2). On average, 12.86% and 5.38% of UP clusters 
show evidence for positive selection before and after man- 
ual curation, respectively (Table 2). This discrepancy 
shows how important manual curation for alignment er- 
rors is as we discovered around 50% of alignments with a 
possible false positive signal. As we were very strict during 
the manual curation process, the clusters remaining can 
be considered as true positives but we might have re- 
moved some other true positives. There is no significant 
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Figure 2 Alignment length against cluster size. Each dot in the scatter plot represents (a) an ultraparalog or (b) a superortholog alignment. 
The histogram above the scatter plot represents the count of alignments for each cluster size; the histogram right to the scatter plot represents 
the counts of alignments for each alignment length, bp: base pairs. 
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Figure 3 Number of detected UP clusters for every species against divergence time. No significant correlation was observed (p: -0.1 71, p = 0.626). 



difference between the number of UPl and UP2 clusters 
under selection although we detected less - sometimes 
zero - clusters with codons under selection in the UPl 
dataset, most likely because of a small sample size in this 
dataset (160 clusters vs. 1,512 UP2 clusters). Interestingly, 
no SOI or S02 cluster seems to have evolved under 



positive selection (Table 2). We also defined a new sub- 
category of clusters denoted UPps that contains the 90 UP 
clusters for which positive selected sites were detected 
and manually validated (Table 1). The UPps clusters have 
a longer median length (1,314 bp) and larger median clus- 
ter size (8) than the other UP and SO clusters (Table 1). 



Table 2 Clusters containing codons under positive selection before and after manual curation 

Clusters used in final Clusters under selection before manual Clusters under selection after manual 

analysis curation (%) curation (%) 



Species 


UPl 




UP2 


UPl 




UP2 


UPl 




UP2 


M. acuminata 


36 




107 


1 (2.78) 




6 (5.61) 


0 (0.00) 




4 (3.74) 


0. sativa 


7 




145 


1 (14.29) 




29 (20.00) 


1 (14.29) 




1 1 (7.59) 


B. distachyon 


4 




59 


0 (0.00) 




14 (23.73) 


0 (0.00) 




2 (3.39) 


Z. mays 


24 




226 


4 (16.67) 




32 (14.16) 


0 (0.00) 




9 (3.98) 


5. bicolor 


4 




93 


0 (0.00) 




9 (9.68) 


0 (0.00) 




4 (4.30) 


V. vinifera 


9 




114 


1 (11.11) 




10 (8.77) 


0 (0.00) 




3 (2.63) 


A. thaiiana 


13 




138 


0 (0.00) 




25 (18.12) 


0 (0.00) 




14 (10.14) 


P. trichocarpa 


16 




132 


3 (18.75) 




18 (13.64) 


1 (6.25) 




12 (9.09) 


G. max 


17 




128 


3 (17.65) 




5 (3.91) 


3 (17.65) 




1 (0.78) 


M. truncatula 


30 




370 


5 (16.67) 




49 (13.24) 


4 (13.33) 




21 (5.68) 


Sum/average 


160 




1,512 


18 (11.25) 




197 (13.03) 


9 (5.63) 




81 (5.36) 


UPall 




1,672 






215 (12.86) 






90 (5.38) 




SOI 




167 






1 (0.60) 






0 (0.00) 




S02 




1,203 






3 (0.25) 






0 (0.00) 
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(0 at the branch level 

The analysis of selective pressures at the branch level was 
performed using mapNH on the same dataset as the 
codon analysis. If co at a branch is larger than 1.2 we con- 
sider this a strong indicator of positive selection (simply 
defining co > 1 as an indicator of positive selection might 
lead to false positives as in a neutral scenario co rather 
fluctuates around 1 than being exactly 1). The mean co of 
the branches is significantly {p < 0.001) higher in UP2 
(0.62) than in S02 (0.29) and the distribution shows a lar- 
ger variance for UP2 than for S02 (Figure 4, Table 3). As 
compared to S02, in UP2 we observe: (i) a higher propor- 
tion of branches with co > 1.2 (8.78%, compared to 0.22% 
for S02), (ii) higher co values for branches with co > 1.2 
(1.80, compared to 1.64 for S02), and (iii) higher co values 
for branches with (o< 1 (0.49 compared to 0.29 for S02; 
Table 3). This indicates a relaxation of purifying selection 
for UP2 in contrast to S02 but also a higher frequency of 
branches harboring an accelerated evolution rate. Similar 
results are observed on the UP and SO clusters extracted 
from the same trees {i.e. UPl and SOI). Mean co is signifi- 
cantly (p< 0.001) higher for UPl (0.51) than for SOI 
(0.28; Table 3). Interestingly, the mean co for UPl and UP2 
differ significantly {p < 0.001; Table 3, Figure 4), indicating 
the family effect mentioned before. For the UPps clusters, 
the mean co (0.84), the proportion of branches with co > 1.2 
(15.79%), and the mean co of branches with co > 1.2 (1.95) 
are higher compared to the UPl and UP2 clusters (Table 3, 
Figure 4). 

Effect of cluster size and length 

The UP clusters are longer and contain more sequences 
than the SO clusters (see above). This could lead to an 
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Figure 4 Distribution of o) of branches in different subsets. 
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underestimation of codons under selection in SO clusters 
as codeml has more power to detect footprints of positive 
selection in longer /larger alignments [39]. A general linear 
model analysis showed that differences in alignment 
length cannot explain the detected differences between 
UP and SO clusters if cluster size (=number of sequences 
in alignment) is < 10 (data not shown). Cluster size, how- 
ever, had an effect. In order to test the reliability of our re- 
sults relative to the number of sequences, we performed 
Fisher's exact tests to see if we could find either signifi- 
cantly more clusters, codons, and/or branches under se- 
lection for UP than for SO cluster in each cluster size 
category (up to 10 as this is the maximum for SO clus- 
ters). We find significantly more clusters under positive 
selection for UP clusters for the size categories 6 and 7 
(Table 4). For the other size categories we lack power to 
detect significant differences (Table 4). We also detect sig- 
nificantly more codons showing footprints of selection in 
UP clusters for the size categories 6-9 (Table 4). In 
addition, branches with co > 1.2 are significantly more fre- 
quent in UP clusters for size categories 5-10 (Table 4). To 
summarize, UP clusters still show more signatures of posi- 
tive selection more frequently after controlling for cluster 
size effect. 

Effect of evolutionary time and polymorphism 

To see if our results are biased by divergence discrepan- 
cies between UP and SO, we sorted the co value of each 
branch by their synonymous substitution rate {ds). To rule 
out the effect of polymorphism, we excluded ("young") ex- 
ternal branches from the dataset and compared the 
remaining ("old") internal branches (UPint) to the SO 
dataset. We found a significant difference between the co 
of SO and UPint in ds intervals ranging from 0.01 to 0.21 
(Figure 5a+b). There is no significant difference in the first 
ds interval (Figure 5a), most likely because of residual 
polymorphism and/or a low mutation rate in SO and 
UP clusters. This interval harbors, however, more than 
50% of the dataset. These results indicate that - except 
for very low ds values - the difference between SO and 
UP cluster cannot be explained solely by divergence dis- 
crepancies or residual polymorphism. Above ds values of 
0.21 the Mann- Whitney test is inconclusive (Figure 5b) 
due to lack of power. 

Annotation of clusters under selection 

The GreenPhylDB provides details on predicted molecular 
function, biological process, cellular component, and fam- 
ily and domain annotation for each cluster. We extracted 
those details for clusters found to have evolved under posi- 
tive selection using codemls site model. Additional file 2 
provides the details of the annotations for all the clusters 
with codons under selection before excluding clusters 
which derive from the same GreenPhyl tree (see Methods). 
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Table 3 Results of the branch analysis with mapNH 





UP1 


UP2 


UPps 


SOI 


S02 


Number of analysed branches 


1,257 


14,326 


1,298 


1,807 


13,374 


Branches with oo < 1 {%f 


1,144 (91.01) 


12,515 (87.36) 


993 (76.50) 


1,799 (99.56) 


13,329 (99.66) 


Mean oo for branches with o) < 1 


0.41 


0.49 


0.59 


0.28 


0.29 


Branches with o) > 1 {%f 


113 (8.99) 


1,811 (12.64) 


305 (23.50) 


8 (0.44) 


45 (0.34) 


Mean oo for branches with oo > 1 


1.55 


1.52 


1.67 


1.37 


1.44 


Branches with oo > 1 .2 {%f 


73 (5.81) 


1,099 (8.78) 


205 (15.79) 


4 (0.22) 


23 (0.17) 


Mean oo for branches with oo > 1 .2 


1.81 


1.80 


1.95 


1.64 


1.79 


Mean oo±SE 


0.51 ±0.44 


0.62 ± 0.47 


0.84 ± 0.65 


0.28 ±0.1 7 


0.29 ±0.1 7 



^of analysed branches. 

Annotation is an ongoing process on the GreenPhylDB; 
therefore most of the clusters are not annotated - especially 
in monocots. There seems to be no trend in tree size or 
species specificity as clusters shown to have codons under 
selection can both be found in large trees containing se- 
quences from various plant species and from small species 
specific trees (Additional file 2). As annotation is ongoing 
and remains under constant modification, a comprehen- 
sive analysis of the potential function of the clusters with 
codons under selection would not lead to reliable results. 
However, some trends can be observed: (i) the most abun- 
dant molecular function is "protein binding" (21.57% of all 
annotated molecular functions in the dataset) followed by 
"transferase activity" (9.80%). This is especially true in the 
Level 2 dataset {Le, clusters derived from large GreenPhyl 
trees) whereas potential molecular functions seem to be 
more diverse in the Level 1 dataset (Additional file 2). (ii) 
The most common predicted biological functions are 
"metabolic process" (23.53% of all annotated biological 
processes in the dataset) and "oxidation-reduction process" 
(20.59%). "Defense" (14.71%) is also dominant, but only 
in the Level 2 dataset (Additional file 2). (iii) If domains 
are annotated to the clusters with codons under selection. 



F-box (22.54% of all annotated domains in the dataset). 
Leucine rich repeats (LRR; 11.27%), and NB-ARCs (8.45%) 
are predominant. Again, this trend is mostly visible in 
the Level 2 dataset whereas potential domains are more 
diverse in the Level 1 dataset (Additional file 2). 

Discussion 

The important role of duplicated genes in plant adaptation 
has been argued theoretically (reviewed by [43]). To assess 
whether lineage specific expanded (LSE) genes show more 
evidence for positive selection than single-copy genes we 
analyzed LSE gene families from ten angiosperm genomes 
using a <iN/<is -based test. We found positive selection 
footprints moderately frequently at the codon level in LSE 
genes (5.38% in average among the different species) but 
did not find any positive selection footprints on single- 
copy genes after manual curation. The number of codons 
under positive selection is also found higher in LSE than 
in single copy genes for different cluster size categories 
and thus cannot be explained solely by a difference of 
power to detect positive selection between the two data- 
sets. Positive selection is also detected in LSE genes at 
the branch level and we found a significantly higher 



Table 4 Results of Fisher's exact test 



Number of clusters 



Number of codons 



Number of branches 



Cluster 
size 


UP under/not 
under positive 
selection 


SO under/not 
under positive 
selection 


p-value 
Fisher's 
exact 
test^ 


UP under/not 
under positive 
selection 


SO under/not 
under positive 
selection 


p-value 
Fisher's 
exact 
test^ 


UP under/not 
under positive 
selection 


SO under/not 
under positive 
selection 


p-value 
Fisher's 
exact 
test^ 


4 


1/48 


0/3 


1 


2/22,821 


0/1,467 


1 


9/210 


0/15 


1 


5 


4/102 


0/12 


1 


16/51,017 


0/4,187 


0.62 


51/483 


0/84 


8.78E-04* 


6 


24/474 


0/487 


9.27E-08*** 


66/210,761 


0/191,533 


2.20E-16*** 


184/3,456 


6/4,329 


2.20E-16*** 


7 


15/280 


0/405 


1 .90E-06*** 


43/127,403 


0/163,947 


3.59E-16*** 


136/2,299 


8/4,378 


2.20E-16*** 


8 


4/178 


0/293 


0.02 


24/81,803 


0/110,494 


1 .24E-09*** 


117/1,430 


5/3,744 


2.20E-16*** 


9 


7/108 


0/144 


3.07E-03 


19/49,429 


0/57,346 


4.42E-07*** 


69/1,110 


7/2,085 


2.20E-16*** 


10 


4/73 


0/26 


0.57 


14/36,324 


0/10,298 


0.05 


76/714 


1/425 


6.03E-14*** 



The table contains the results of Fisher's exact test for number of clusters, codons, and branches under positive selection vs. not under positive selection in UP 
and SO clusters for different cluster size categories. 
^Bonferroni corrected for multiple testing, 
p < 2.38E-03, < 4.76E-05. 
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proportion of branches under positive selection among 
LSE gene trees than among single-copy ones. Inferring 
d^lds at the branch level is complementary to analyzing 
d^lds at the codon level Using site models, d^/ds- 
based tests have the greatest power to detect footprints 
of selection in genes involved in co-evolutionary pro- 
cesses as a limited subset of their codons is repeatedly 
subject to positive selection (reviewed by [44]). At the 
branch level, the evolutionary rate is averaged over the 
complete amino acid sequence, making it difficult to de- 
tect a signal when only few sites are targets of positive 
selection. However, an elevated evolutionary rate can be 
detected even if it affects only certain lineages. When 
d^/ds was computed on all the branches of the same 
dataset as for site analyses, we detected a stronger effect 
of positive selection on LSE genes compared to single- 
copy genes. Therefore, we argue that LSE genes are a 
much more important substrate for positive selection to 
act on than single-copy genes. This is - to our knowledge - 
the first genome-scale study to empirically demonstrate 
that LSE genes fuel adaptation in angiosperms. 

Among the vast literature dealing with population gen- 
etic models of duplicated gene evolution, a crucial point is 
whether natural selection plays a role in it [18]. Positive 
selection is expected to act either on the fixation process 
of the duplication itself or at new mutations occurring 
after fixation of the copy in the species (or at both levels 
successively). We found a significantly larger portion of 
LSE genes under positive selection compared to single 
copy ones. Hence, the differentiation between copies for 
LSE genes is driven by changes in proteins, with all the 
functional consequences this may imply. This result corre- 
sponds to predictions made by several models, e.g. the 
"adaptation" model [16,19] or the "adaptive conflict" 
model [20-22]. In these scenarios, the duplication itself is 
not subject to positive selection, and may be fixed by gen- 
etic drift. However, our results may be coherent with a 
third scenario of segregation avoidance [45] where several 
alleles are pre-existing at the ancestral unique locus and 
their retention is advantageous [46,47]. Thus, duplications 
may favor the retention of those alleles if each of them 
gets fixed at one of the different locus resulting from the 
duplication process. In this scenario, positive selection 
does occur on the fixation process itself and the non- 
synonymous mutation observed would have appeared be- 
fore the duplication process. However, it is not possible to 
tell which of these scenarios is more likely in our data, all 
the more that those scenarios can be combined in more 
complex ones. For instance, a first duplication may occur 
allowing a unique gene to escape an adaptive conflict and 
subsequent duplications may occur; generating additional 
copies following - this time - an adaptive scenario. 

Recent progress in angiosperm whole genome sequen- 
cing gave numerous arguments in favor of the positive 



role of polyploidy in the exceptional radiation and diversi- 
fication of angiosperms [48-50]. These hypotheses rely on 
the evolutionary potential caused by genomic shocks such 
as polyploidy. Our study shows that genomic events leading 
to gene duplications at a smaller scale - especially when re- 
curring at a high frequency as it has been described in 
angiosperm genomes [8-10] - appear also fundamental in 
the adaptive dynamic of angiosperms. Recurrent gene du- 
plication/retention offer a mechanism complementary to 
WGD as it may take place all along the evolutionary time 
and can affect a specific subset of gene families. Such fam- 
ilies might be targeted according to their implication in bio- 
logical processes or molecular functions related to the 
ongoing natural selective pressure. This could be reflected 
by the trends we observed in the annotations of the genes 
containing codons under selection: many are involved in 
defense and protein binding is the most common molecu- 
lar function. 

The most abundant domains we found in LSE clusters 
showing signatures of positive selection are F-box and 
LRR domains. F-box proteins (FBP) are one of the largest 
and fastest evolving gene families in land plants [51]. 
When analyzing FBP subfamilies in seven land plant 
species, it was found that 64-67% of duplications are 
species-specific - mostly in angiosperms [52,53]. Expres- 
sion analysis of LSE FBPs showed a fast subfunctionaliza- 
tion on the transcriptional level [52,53]. Finally, it was also 
found that the LSE FBP are less conserved than their 
single-copy counterparts and signatures of positive selec- 
tion are predominantly found in the protein-protein inter- 
action domains of the FBPs [52,53]. An equally large gene 
family comprises of receptor-like kinases (RLK) containing 
LRRs in their extracellular domain [54]. Two main func- 
tions are described for LRR-RLKs: development and 
defense [55]. LRR-RLKs involved in defense are predomin- 
antly found in LSE gene clusters whereas LRR-RLKs in- 
volved in development are mostly found in non-expanded 
groups [55]. It was also discovered that the LRR domains 
are significantly less conserved than the remaining do- 
mains of the LRR-RLK genes [55]. In addition, a study on 
four plant genomes showed that LRR-RLK genes from 
LSE gene clusters show significantly more indication of 
positive selection or relaxed constraint than LRR-RLKs 
from non-expanded groups [55]. Therefore, it is not sur- 
prising that F-box and LRR domains are the most abun- 
dant domains we found in the LSE clusters with codons 
under positive selection. First, proteins containing these 
domains constitute large gene families and are therefore 
likely to show up in our LSE dataset - especially when 
coming from the GreenPhyl Level 2 dataset as it com- 
prises of large trees. Second, several studies showed 
that these proteins/domains are prone to fast evolution 
and adaptation [51,55]. The results shown here give 
valuable insight in the evolution of large gene families 



Fischer et at. BMC Plant Biology 2014, 14:1 51 
http://www.bionnedcentral.conn/1471-2229/14/151 



Page 10 of 15 



and provide the groundwork for more detailed analyses of 
these candidates. 

As automated multi-step genome wide analyses can 
sometimes introduce biases and misinterpretations, we 
took the maximum of precautions at each step. First, we 
chose well-annotated genomes to reduce the bias of mis- 
annotations, although we cannot completely rule them 
out. Annotation errors could lead to an over-estimation of 
the evolutionary rate in duplicated genes [56] . This left us 
with ten angiosperm genomes, even though many com- 
pletely sequenced genomes are now available. Second, as 
^n/'^s -based methods are very sensitive to alignment er- 
rors [57,58], reliable alignment and cleaning tools are 
mandatory. We used PRANK and GUIDANCE to align 
and clean the sequence clusters. Those recent methods 
have been found to produce the most reliable alignments 
for downstream analysis using the PAML software [57,58]. 
Third, we curated the alignments for which we detected 
positive selection manually. As this is a great deal of work 
in large datasets many studies fail to do this. However, we 
argue that this step is crucial to produce reliable results as 
we found around 50% alignment errors and therefore false 
positives. The manual validation of all the positively se- 
lected sites is a major strength of our study. Fourth, the 
power for d^ld^ analysis is related to the number of se- 
quences aligned. In our dataset the difference in sequence 
number was significant between the LSE and the single- 
copy dataset. This could explain, at least partially, the de- 
tection of a higher number of clusters with sites under 
positive selection. By analyzing LSE and single-copy gene 
clusters in each size categories separately we ruled out the 
effect of cluster size and showed that the number of clus- 
ters, codons and branches under positive selection is al- 
ways higher in LSE genes compared to single-copy genes. 
Fifth, we wanted control for a potential "family effect" that 
could result from the fact that some gene families showing 
accelerated evolutionary rate in general, e,g, because of 
their function or base composition, may also be more 
prone to gene duplication and/or retention than others. 
Using subgroups we indeed found an effect: LSE clusters 
from trees containing also a single-copy gene clusters 
show a lower d^ld^ compared to LSE clusters from trees 
without single-copy gene clusters. This means that the 
more a gene family is prone to duplication/retention the 
less probable a single-copy gene cluster will be found. 
Here, we give an argument in favor of the hypothesis that 
the initial level of selective constraint partially conditions 
the frequency of duplication/retention. We detect a family 
effect in different trees but the d^ld^ difference between 
LSE clusters and single-copy gene sets remains significant 
when controlling for this effect by comparing clusters ex- 
tracted from the same gene trees. 

Finally, when analyzing very recent duplicates it is pos- 
sible that the differences between copies are still segregating 



within populations which violates basic assumptions of 
<iN/<is-based tests [59]. Our LSE dataset may include 
genes where differences are still polymorphic which can 
lead to an overestimation of positive selection [59,60]. 
As expected, d^lds is elevated - and most likely over- 
estimated - for low (is values in LSE as well as in single- 
copy gene clusters. The reason for this effect is either 
polymorphism segregating in young copies (mostly the 
case in LSE genes) or a low mutation rate (mostly the 
case in single-copy genes). However, even after removing 
external ("young") LSE branches, the difference between 
single-copy and LSE gene clusters is still significant for d^ 
values above 0.01. This result shows that polymorphism 
and/or a low mutation rate alone cannot explain the dif- 
ferences in d^lds between LSE and single-copy genes. 

Functional analysis is difficult in recently expanded gene 
families because functional or gene expression differences 
are difficult to investigate due to highly similar sequences 
among copies. Additionally, many of these genes are in- 
volved in stress responses [11,12] and therefore specific 
conditions need to be defined a priori. Consequently, mo- 
lecular evolution studies like ours are a good alternative to 
identify candidates in which family expansion is followed 
by an adaptive process to conduct further analyses. An- 
other next step could be to investigate links between our 
results and the duplication mode. By looking at the loca- 
tion of duplicated genes in the genome the duplication 
mode can be assessed. Several studies showed that the du- 
plication mode has an impact on genetic novelty and 
adaptation [61,62]. For example, it was demonstrated that 
TDGs are more often involved in abiotic stress response 
than non-TDGs [10,11,63]. However, a d^ld^ approach is 
not suitable to provide evidence for positive selection on 
the duplication process itself which is the assumption 
under the dosage effect hypothesis [13]. Therefore, we ig- 
nore gene conservation as potential outcome and subse- 
quently probably underestimate the role of adaptation in 
gene duplication/retention. 

Conclusions 

In this paper we conduct one of the largest studies on the 
role of recurrent gene duplication on adaptation in angio- 
sperms so far. Indeed, most of the former studies either 
dealt with candidate families in a broad taxonomical range 
{e,g, [35-37]) or whole genomes for a maximum of four 
plant species {e,g, [11,12]). We searched duplicated genes 
from ten angiosperm genomes for footprints of positive 
selection and our results provide candidates for further 
functional or population genetic studies. In general, we 
used a very conservative approach to detect positive selec- 
tion footprints at LSE genes and might therefore miss 
many true positives. Still, because of the inherent differ- 
ences between LSE and single-copy datasets, our results 
must be interpreted with caution. As the number and 
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quality of sequenced genomes is increasing daily, our 
analysis can be expanded to many more plant species 
in the future. In addition, current efforts in re-sequencing 
numerous genomes from different populations could give 
the opportunity to differentiate between divergence and 
polymorphism and to consequently provide even better 
estimates of quantity and quality of positive selection 
undergone by LSE genes. 

Methods 

Genomes, proteomes, identification of ultraparalog 
clusters and superortholog gene sets 

As analysis of duplicated genes is very sensitive to gene 
annotation errors we chose five well annotated monocot 
and five well annotated dicot genomes (see details on 
our genome selection criteria in Additional file 1): Musa 
acuminata vl.O (banana) [5], Oryza sativa subsp, japon- 
ica v6.0 TEfiltered (Asian rice) [9], Brachypodium dis- 
tachyon vl.O (purple false brome) [64], Zea mays v5.6 
filtered (maize) [65], Sorghum bicolor vl.4 (milo) [66], 
Vitis vinifera vl.O (common grape vine) [4], Arabidopsis 
thaliana vlO.O (thale cress) [8], Populus trichocarpa v2.2 
(black Cottonwood) [67], Glycine max vl.O (soybean) 
[68], and Medicago truncatula v3.5 (barrel medic) [69]. 
The phylogeny of those species is provided in Additional 
file 1. We used the information provided by the Green- 
Phyl v3 database (http://www.greenphyl.org) which uses 
a tree reconciliation approach [70] to identif)^ orthologs 
(genes related by speciation) and paralogs (genes related 
by duplication) in protein trees. This database contains 
protein families' composition and phylogenies for a broad 
variety of green plants whose genomes have been com- 
pletely sequenced [41]. Based on their sequence similarity, 
the GreenPhylDB clusters gene families at different levels 
from the less stringent (large clusters of relatively similar 
sequences at Level 1) to the most stringent (small clusters 
of highly similar sequence at Level 4). First, we extracted 
3,330 protein clusters from Level 1. As large gene families 
(>500 sequences) are not further analyzed in GreenPhyl, 
we extracted 2,238 protein clusters from Level 2 for these 
gene familie. These are two separate datasets and Level 2 
trees are not nested in Level 1 tress (see GreenPhyl home- 
page for details: http://www.greenphyl.org/). 

We extracted ultraparalog clusters (UP - sequences 
only related by duplication) from the GreenPhylDB trees 
on which duplication and speciation events were posi- 
tioned according to the tree reconciliation approach 
cited previously (Figure la). Those clusters represent 
our LSE gene set. As a single-copy gene reference, we 
chose a superortholog gene set (SO - sequences only re- 
lated by speciation). We ignored clusters with less than 
six sequences. The SO clusters were divided into clusters 
coming from the same tree as UP clusters (SOI) or from 
trees exclusively harboring SO clusters (S02). Likewise, 



UP clusters were divided in clusters coming from trees 
containing SO clusters (UPl) or from trees with only UP 
clusters (UP2). Note that when a GreenPhyl tree harbors 
several SO and/or UP clusters, all were extracted. We 
downloaded the corresponding complete CDS of the 
species of interest (links on GreenPhylDB Documenta- 
tion section). In case of alternative spliceforms, the lon- 
gest one is kept in the GreenPhylDB pipeline; it is thus 
the one we downloaded. Most GreenPhyl trees are too 
large and/or too divergent to create reliable nucleotide 
alignments and perform d^I ds-hsised tests on the whole 
tree alignment. This is especially true for the most inter- 
esting cases where trees contain both UP and SO clusters 
(the UPl /SOI dataset). We therefore chose to analyze 
each UP and SO cluster independently. 

In GreenPhyl trees harboring several UP and/or SO 
clusters i.e. in gene families in which gene duplication/ 
retention might be more frequent, one might expect select- 
ive constraint to be different, in particular more relaxed. 
Therefore, some gene families might be overrepresented 
when several clusters from the same tree are analyzed sep- 
arately. To avoid this, an additional step of selection was 
added to the initial dataset as we randomly kept only one 
cluster each time several clusters of UP or several clusters 
of SO were identified from a same tree and removed all 
other clusters from our analysis. Here, we present the re- 
sults for this final sub-dataset. However, we performed our 
analysis on three additional sub-datasets: (i) the whole 
dataset without removing clusters from trees harboring 
more than one cluster, (ii) a dataset which contains clusters 
from GreenPhyl trees with only one UP and/or one SO 
cluster, (iii) a dataset where only clusters from trees har- 
boring more than one cluster were kept. The results for 
these sub-datasets can be found in Additional file 1. 
However, the trends we observe remain, no matter which 
sub-dataset is analyzed (Additional file 1). 

Alignment and cleaning 

We used PRANK+f with codon option [71] for creating 
the alignments and GUIDANCE [42] with the default se- 
quence quality cut-off and a column cut-off of 0.97 to re- 
move problematic sequences and unreliable sites from the 
initial alignments (Figure lb). Those choices were guided 
by several recent studies which found PRANKcodon and 
the PRANKcodon-GUIDANCE combination to produce 
the most reliable alignments for further inference of posi- 
tive selection using codeml [57,58]. Filtering removed all 
sequences from 33 UP clusters, it kept three or less se- 
quences for 91 UP and two SO clusters; all those clusters 
were thus ignored in further analyses as a minimum of 
four sequences was required. For some species (namely Z. 
mays, S. bicolor, G. max, and M. truncatula), the retrieved 
CDS seemed to contain un-translated regions (UTRs) as 
for 126 UP and four SO clusters one or more sequences 
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contained stop codons or incomplete codons {i.e. length 
not divisible by three). Those clusters were also removed 
from the analysis. Additionally, for 18 UP clusters codeml 
failed to run (probably due to insufficient sequences over- 
lap). We retrieved 167 UPl and 167 SOI as well as 1,656 
UP2 and 1,203 S02 clusters. After cleaning, our final data- 
set for codeml analysis comprised 160 UPl, 1,512 UP2, 
167 SOI, and 1,203 S02 clusters for the codeml analysis 
(Table 1). 

As alignment errors can create false positives in the de- 
tection of positive selection footprints, each cluster sug- 
gested to be under positive selection was again checked 
both automatically - using muscle [72] and trimAL [73] 
for creating and cleaning alignments (muscle-trimAL 
method; see Additional file 1) - and manually for align- 
ment errors. We found that our initial alignment and 
cleaning procedure using PRANK [71] and GUIDANCE 
[42] is superior to the muscle-trimAL method. Manual 
curation, however, remains essential to avoid false posi- 
tives (Additional file 1). 

Detecting codons under positive selection 

We used codeml site model implemented in the PAML4 
software [74] to infer positive selection on codons under 
several substitution models. For these analyses, we exten- 
sively relied on the egglib package [75] to implement the 
following pipeline: First, for every alignment the max- 
imum likelihood phylogeny was inferred at the nucleotide 
level using PhyML 3.0 [76] under the GTR-F substitution 
model (Figure Ic). Second, different codeml site models 
were run (Figure Id). The nearly neutral models (Mia and 
M8a) assume codons to evolve either neutrally or under 
purifying selection whereas the positive selection models 
(M2a and M8) assume positive selection acting on some 
codons. Third, likelihood ratio tests (LRTs) were per- 
formed using R [77] to compare nearly neutral and posi- 
tive selection models and hence to detect clusters for 
which models including positive selection are significantly 
more likely than models that do not. We corrected for 
multiple testing using a Bonferroni correction. In clusters 
identified to have evolved under positive selection, Bayes 
empirical Bayes was used to calculate the posterior prob- 
abilities at each codon and detect those under positive se- 
lection {i.e. those with a posterior probability of co > 1 
strictly above 95%). All alignments detected to be under 
positive selection at the codon level were curated manu- 
ally for potential alignment errors. More details on the 
estimated omega for each cluster with codons under 
positive selection, position of every codon under posi- 
tive selection, and results of the LRT for those clusters 
can be found in Additional file 3. All cleaned alignments 
containing codons under positive selection are provided 
in Additional file 4. 



Assessing d/^/ds at branches 

For inferring co on branches, the alignments and the cor- 
responding phylogenies were used as input for mapNH 
[78,79]. Unlike the branch-site model in codeml, this 
method does not require to define branches under selec- 
tion a priori [78]. mapNH performs substitution mapping 
before clustering branches according to their underlying 
substitution processes (Figure Id). The co of each branch 
was then calculated as followed: 

_ nbNS/NSsites 
nbS/Ssites 

using nbNS (number of non-synonymous mutations) 
and nbS (number of synonymous mutations) estimations 
provided by mapNH whereas NSsites (number of non- 
synonymous sites) and Ssites (number of synonymous 
sites) were computed by codeml during the site model 
analysis. We preferably used the NSsites and Ssites pro- 
vided by codeml since they benefit from the maximum 
likelihood estimation of the transition/transversion ratio 
done by codeml for each alignment. Finally, note that co 
was estimated only for clusters with at least one synonym- 
ous and one non-synonymous mutation. After clusters 
with no mutation were removed for the mapNH analysis, 
154 UPl, 1,435 UP2, 167 SOI, and 1,203 S02 clusters 
remained (Table 1). Branches containing no substitutions 
were also removed, leaving us with 1,257 UPl, 14,326 
UP2, 1,807 SOI, and 13,374 S02 branches for the final 
analysis (Table 1). 

Determining effects of time and polymorphism 

SO and UP clusters are different by definition. First, the 
divergence times between sequences are not expected to 
be the same. Specifically, divergence in a given SO cluster 
should range between minimum and maximum diver- 
gence time of the species included in this cluster. Diver- 
gence in UP clusters should range from null (for very 
recent duplications) to the last speciation event. It has 
been shown that d^I ds-hsised tests are strongly influenced 
by ds [59]. To test whether our results are biased due to 
divergence discrepancies between UP and SO, we sorted 
the (0 value of each branch by their synonymous substitu- 
tion rate {ds)^ Second, in the UP dataset some duplications 
could have occurred very recently. It is likely that some 
differences between those young paralogs are still segre- 
gating in populations and should therefore be considered 
as polymorphism instead of divergence. Inferring selection 
using d^/ds in such a scenario has been shown to be in- 
correct [60]. To rule out effects of polymorphism on UP 
clusters, we excluded external branches from the dataset 
and compared the remaining internal branches to the SO 
dataset. To test if co differs significantly between types of 
clusters, we performed a Mann- Whitney test using R 
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[77]. When (o is analyzed according to ds, Mann- Whitney 
tests were performed in a sliding window of 0.01 d^^ The 
calculation was done when a window contained at least 
100 values by group studied. 

Availability of supporting data 

The data sets supporting the results of this article are 
included within the article and its additional files. 
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